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Abstract — Dual descent methods are commonly used to solve 
network optimization problems because their implementation 
can be distributed through the network. However, their con- 
vergence rates are typically very slow. This paper introduces a 
family of dual descent algorithms that use approximate Newton 
directions to accelerate the convergence rate of conventional dual 
descent. These approximate directions can be computed using 
local information exchanges thereby retaining the benefits of dis- 
tributed implementations. The approximate Newton directions 
are obtained through matrix splitting techniques and sparse 
Taylor approximations of the inverse Hessian. We show that, sim- 
ilarly to conventional Newton methods, the proposed algorithm 
exhibits superlinear convergence within a neighborhood of the 
optimal value. Numerical analysis corroborates that convergence 
times are between one to two orders of magnitude faster than 
existing distributed optimization methods. A connection with 
recent developments that use consensus iterations to compute 
approximate Newton directions is also presented. 

I. Introduction 

Conventional approaches to network optimization are based 
on subgradient descent in either the primal or dual domain; 
see, e.g., 0, OH, E2, flS). For many classes of prob- 
lems, subgradient descent algorithms yield iterations that 
can be implemented through distributed updates based on 
local information exchanges. However, practical applicability 
of the resulting algorithms is limited by exceedingly slow 
convergence rates. To overcome this limitation second order 
Newton methods could be used, but this would require the 
computation of Newton steps which cannot be accomplished 
through local information exchanges. This issue is solved in 
this paper through the introduction of a family of approxima- 
tions to the Newton step. 

The particular problem we consider is the network flow 
problem. Network connectivity is modeled as a directed graph 
and the goal of the network is to support a single information 
flow specified by incoming rates at an arbitrary number of 
sources and outgoing rates at an arbitrary number of sinks. 
Each edge of the network is associated with a concave 
function that determines the cost of traversing that edge as 
a function of flow units transmitted across the link. Our 
objective is to find the optimal flows over all links. Optimal 
flows can be found by solving a concave optimization problem 
with linear equality constraints (Section II). In particular, 
the use of subgradient descent in the dual domain allows 
the development of a distributed iterative algorithm. In this 
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distributed implementation nodes keep track of variables 
associated with their outgoing edges and undertake updates 
based on their local variables and variables available at 
adjacent nodes (Section II-A). Distributed implementation is 
appealing because it avoids the cost and fragility of collecting 
all information at a centralized location. However, due to 
low convergence rates of subgradient descent algorithms, 
the number of iterations necessary to find optimal flows is 
typically very large [13|, |15|. The natural alternative is the 
use of second order Newton's methods, but they cannot be 
implemented in a distributed manner (Section II-B). 

Indeed, implementation of Newton's method necessitates 
computation of the inverse of the dual Hessian and a dis- 
tributed implementation would require each node to have 
access to a corresponding row. It is not difficult to see that the 
dual Hessian is in fact a weighted version of the network's 
Laplacian and that as a consequence its rows could be locally 
computed through information exchanges with neighboring 
nodes. Its inversion, however, requires global information. 
Our insight is to consider a Taylor's expansion of the inverse 
Hessian, which, being a polynomial with the Hessian matrix 
as variable, can be implemented through local information 
exchanges. More precisely, considering only the zeroth order 
term in the Taylor's expansion yields an approximation to 
the Hessian inverse based on local information only - which, 
incidentally, coincides with the method of Hessian diagonal 
inverses proposed in [1]. The first order approximation ne- 
cessitates information available at neighboring nodes and in 
general, the Nth order approximation necessitates information 
from nodes located A*" hops away (Section III). The resultant 
family of algorithms, denoted ADD-N permits a tradeoff 
between accurate Hessian approximation and communication 
cost. Despite the fact that the proposed distributed algo- 
rithms rely on approximate Newton directions, we show that 
they exhibit local quadratic convergence as their centralized 
counterparts (Section IV). An approximate backtracking line 
search is added to the basic algorithm to ensure global 
convergence (Section IV-B). 

Newton-type methods for distributed network optimization 
have been recently proposed in (TJ, (9), fl9l . While specifics 
differ, these papers rely on consensus iterations to compute 
approximate Newton directions. Quite surprisingly, it is pos- 
sible to show that the methods in fl], 0, (19) and the ADD 
algorithm proposed here are equivalent under some conditions 
(Section V). Numerical experiments study the communication 
cost of ADD relative to Q), (9), [19] and to conventional 
subgradient descent. ADD reduces this cost by one order of 
magnitude with respect to Q], J9), 1191 and by two orders of 
magnitude with respect to subgradient descent (Section VI). 



II. The Network Optimization Problem 

Consider a network represented by a directed graph Q — 
(TV, £) with node set TV = {l,...,n}, and edge set £ = 
{1, ...,25}. The zth component of vector x is denoted as 
x % . The notation x > means that all components x % > 
0. The network is deployed to support a single information 
flow specified by incoming rates b % > at source nodes and 
outgoing rates b % < at sink nodes. Rate requirements are 
collected in a vector b, which to ensure problem feasibility 
has to satisfy J27=i ^ = O ur S oa l is t0 determine a flow 
vector x — [x e ) ee £, with x e denoting the amount of flow on 
edge e = We require the some additional notation; the 

transpose of vector x is x', the inner product of x and y is 
x'y, and the Euclidean norm of x is ||x|| := \x'x. 

Flow conservation implies that it must be Ax = b, with A 
the n x E node-edge incidence matrix defined as 

{1 if edge j leaves node i, 
— 1 if edge j enters node i, 
otherwise. 

The element in the ith row and jth column of a matrix A 
is written as The transpose of A is denoted as A'. 

We define the reward as the negative of scalar cost function 
e (x e ) denoting the cost of x e units of flow traversing edge e. 
We assume that the cost functions <f) e are strictly convex and 
twice continuously differentiable. The max reward network 
optimization problem is then defined as 

E 

maximize — f(x) — — <fi e (x e ), subject to: Ax = b. 

e=l 

(1) 

Our goal is to investigate Newton-type iterative distributed 
methods for solving the optimization problem in ([TJ. Before 
doing that, let us discuss the workhorse distributed solution 
based on dual subgradient descent (Section II-A) and the 
conventional centralized Newton's method (Section II-B). 

A. Dual Subgradient Method 

Dual subgradient descent solves ([T} by descending in the 
dual domain. Start then by defining the Lagrangian function 
of problem |IJ as C(x,l) = — J2f=i 4>e{x e ) + 1' (Ax — b) and 
the dual function q(l) as 

q(l) = sup £(>,0= sup I — 4>e{x e ) + I' Ax ] — I'b 

E 

= Yu su p ( - + ( l ' A T xe ) - l '*>, (2) 

where in the last equality we wrote V Ax = ^2f =1 (l' A) e x e 
and exchanged the order of the sum and supremum operators. 

It can be seen from |2| that the evaluation of the dual 
function q(l) decomposes into E one-dimensional optimiza- 
tion problems that appear in the sum. We assume that each 
of these problems has an optimal solution, which is unique 
because of the strict convexity of the functions <f) e . Denote 



this unique solution as x e (l) and use the first order optimality 
conditions for these problems in order to write 

x e (l) = (4>' e )- 1 (l i -l j ), (3) 

where i e TV and j e TV respectively denote the source and 
destination nodes of edge e = As per ^ the evaluation 

of x e (l) for each node e is based on local information about 
the edge cost function <j> e and the dual variables of the incident 
nodes i and j. 

The dual problem of ([!]) is defined as min/ e R>. q(l). The 
dual function is convex, because all dual functions of min- 
imization problems are, and differentiable, because the <f> e 
functions are strictly convex. Therefore, the dual problem can 
be solved using gradient descent. Consider an iteration index 
k, an arbitrary initial vector Iq and define iterates generated 
by the following: 

h+i =h~ ctk9k for all k > 0, (4) 

where = g(lk) = Vq(Zfc) denotes the gradient of the dual 
function q(l) at I = Ik- A first important observation here is 
that we can compute the gradient as gk = Ax (Ik) — b with 
the vector x(lk) having components x e (lk) as determined by 
^ with I = l k ,[2, Section 6.4]. Differentiability of g(\) 
follows from strict convexity of (QJ. A second important 
observation is that because of the sparsity pattern of the node- 
edge incidence matrix A the ith element g\ of the gradient 
gk can be computed as 

gl= J2 * e (^)- E ^Vti-bi (5) 

e=(i,j) e=(j.i) 

The algorithm in Q lends itself to distributed implementation. 
Each node i maintains information about its dual iterates 
l\ and primal iterates x e (lk) of outgoing edges e = (i,j). 
Gradient components g\ are evaluated as per (|5jl using local 
primal iterates x e (lk) for e = (i,j) and primal iterates of 
neighboring nodes x e (lk) for e = Dual variables are 

then updated as per Q. Having updated the dual iterates, we 
proceed to update primal variables as per ([3]). This update 
necessitates local multipliers /[ and neighboring multipliers 
l j 

Distributed implementation is appealing because it avoids 
the cost and fragility of collecting all information at a cen- 
tralized location. However, practical applicability of gradient 
descent algorithms is hindered by slow convergence rates; 
see e.g., lfT3l . |[T5l . This motivates consideration of Newton's 
method which we describe next. 

B. Newton's Method for Dual Descent 

Newton's Method is a descent algorithm along a scaled 
version of the gradient. In lieu of Q iterates are given by 

h+i =h + a k d k for all k > 0, (6) 

where dk is the Newton direction at iteration k and is a 
properly selected step size. The Newton direction, dk satisfies 

H k d k = -gk, (7) 



where Hf. = H(lf.) = V 2 q(l k ) is the Hessian of the dual 
function at the current iterate and g k = g k (lk) is, we recall, 
the corresponding gradient. 

To obtain an expression for the dual Hessian, consider given 
dual l k and primal x k — x(\ k ) variables, and consider the 
second order approximation of the primal objective centered 
at the current primal iterates Xk, 

f(y) = f(x k )+V 'f(x k y ' {y~x k )+\{y~Xk)'V 2 f(xk){y-xk) 

(8) 

The primal optimization problem in ([T]i is now replaced by 
the maximization of the approximating function —f(y) in dsb 
subject to the constraint Ay = b. This approximated problem 
is a quadratic program whose dual is the (also quadratic) 
program 

min 5 (A fe ) = min h' k AV 2 fix^A' X k + p'X k + r. (9) 

The vector p and the constant r can be expressed in closed 
form as functions of Vf(xk) and V 2 f(x k ), but they are 
irrelevant for the discussion here. The important consequence 
of Q is that the dual Hessian is given by 

H k = Q = A{-V 2 f{x k )- 1 )A'. (10) 

From the definition of f(x) in Q it follows that the primal 
Hessian -V 2 /(x/ £ ) is a diagonal matrix, which is negative 
definite by strict convexity of f(x). Therefore, its inverse 
exists and can be computed locally. Further observe that 
the dual Hessian, being the product of the incidence matrix 
times a positive definite diagonal matrix times the incidence 
matrix transpose, is a weighted version of the network graph's 
Laplacian. As a particular consequence it follows that 1 
is an eigenvector of Hk associated with eigenvalue and 
that Hk is invertible on the the subspace 1^. Since the 
gradient g k lies in l 1 - we can find the Newton step as 
d k = —H k g k . However, computation of the pseudoinverse 
H k requires global information. We are therefore interested 
in approximations of the Newton direction requiring local 
information only. 

III. Approximate Newton's Method 

To define an approximate Newton direction, i.e., one for 
which <J7J is approximately true, we will consider a finite 
number of terms of a suitable Taylor's expansion repre- 
sentation of the Newton direction. At iteration k, split the 
Hessian into diagonal elements D k and off diagonal ele- 
ments Bk and write Hk = D k — B k . Further rewrite the 
Hessian as H k = D k 2 (^1 — D 

implies that the Hessian pseudo-inverse is given by HZ = 

-t 



l B k D k 



D k 2 , which 



D, 



D, 



Notice now that for the 



central term of this product we can use the Taylor's expansion 
identity (/ — X)^v = (T^Zq X*) v, which is valid for any 
vector v orthogonal to the eigenvectors of X associated with 
eigenvalue 1. Since g k is orthogonal to 1, it follows 



d k = -H\g k 



D, 



■9k- 



The Newton direction is now represented as an infinite sum 
we can define a family of approximations characterized by 
truncations of this sum, 



N 



i=0 



9k, 
(11) 

where we have defined the approximate Hessian pseudo 

inverse fff> := (p^ B k D~ *)' . The 

approximate Newton algorithm is obtained by replacing 



-H 



the Newton step d k in \6\ by its approximations d k = 



[fc ' g k . The resultant algorithm is characterized by the 
iteration 



h+i — h — a kH k N g k 



(12) 



While not obvious, the choice of N in (jTTJ dictates how 
much information node i needs from the network in order to 
compute the ith element of the approximate Newton direction 
d k N ^ - recall that node i is associated with dual variable l k . 

For the zeroth order approximation d k °^ only the first term 
of the sum in (JTTJ is considered and it therefore suffices 
to have access to the information in D k to compute the 
approximate Newton step. Notice that the approximation 

D k 1 g k implying that we 



(n) 

k 



in this case reduces to d 

approximate H^ 1 by the inverse diagonals which coincides 
with the method in (TJ. 



The first order approximation d k 

7« 



(1) 



uses the first two terms 



DZ 1 B k D k 1 ) g k . 



of the sum in yielding d k L> — (D k . . . . 
The key observation here is that the sparsity pattern 
of B k , and as a consequence the sparsity pattern of 
D k ~ 1 B k D k ~ 1 , is that of the graph Laplacian, which means 
that [DZ BkD7 ]ij ^ if and only if i and j correspond 
to an edge in the graph, i.e, E E. As a consequence, 

to compute the ith element of node i needs to collect 
information that is either locally available or available at 
nodes that share an edge with i. 

(2) 

For the second order approximation d y k ' we add the term 
(D k 1 B k ) D k x to the approximation d k . The sparsity pat- 



tern of (JD, l Bk) D k 1 is that of B k , which is easy to realize 
has nonzero entries matching the 2-hop neighborhoods of 
each node. Therefore, to compute the ith element of d k 
node i requires access to information from neighboring nodes 
and from neighbors of these neighbors. In general, the iVth 
order approximation adds a term of the form (D k 1 B k ) D^ 1 



= to the N — 1st order approximation. The sparsity pattern of 
this term is that of B k , which coincides with the A^-hop 



neighborhood, and computation of the local elements of the 
Newton step necessitates information from N hops away. 



i=0 



We thus interpret (Hi as a family of approximations in- 
dexed by N that yields Hessian approximations requiring in- 
formation from A-hop neighbors in the network. This family 
of methods offers a trade off between communication cost and 
precision of the Newton direction. We analyze convergence 
properties of these methods in the coming sections. 



A. Convergence 

A basic guarantee for any iterative optimization algorithm 
is to show that it eventually approaches a neighborhood of the 
optimal solution. This is not immediate for ADD as defined 
by |l2| i because the errors in the H k N ^ approximations to H"| 
may be significant. Notwithstanding, it is possible to prove 
that the H k N ^ approximations are positive definite for all 
N and from there to conclude that the l k iterates in ( fT~2| > 
eventually approach a neighborhood of the optimal I*. This 
claim is stated and proved in the next proposition. 

Proposition 1. Let I* denote the optimal argument of the dual 
function q{l) of the optimization problem in (|7J and consider 
the ADD-N algorithm characterized by iteration ( 12 \ with 



Hu as in [ID. Assume ct k = a for all k and that the 



network graph is not bipartite. Then, for all sufficiently small 
a, 

lim l k = I* (13) 

k— J-oo 

Proof: As per the Descent Lemma, to prove convergence 
of the lk iterates in ( fl2"] ) it suffices to show that the matrix 
Si is positive definite for all k [2, Proposition A.24]. 

To do so, begin by recalling that the dual Hessian, H k 
is a weighted Laplacian and define the normalized Laplacian 
Lk = D, 2 H k D k 2 , having unit diagonal elements. Applying 
the splitting H k = D k -B k it follows D L ]_B k D k 2 = I-L k 



from which we can write h[ N ^ as [cf. d 1 1 b ] 



g 



N 



H 



(JV) 



EC- 

\i=0 



d: 



(14) 



Focus now on the sum 2~2iLo — ^ k )- non -bip ar tite 
graphs, normalized Laplacians have eigenvalues in the interval 
[0,2), |6, Lemma 1.7]. Therefore, it follows that the eigen- 
values of I—L k , fall in (—1,1]. Furthermore, the normalized 
Laplacian L k has exactly one eigenvector, i/q associated with 
the eigenvalue 0. Observe that since L k v$ = 0, v$ satisfies 



4 



u Q = (N + l)i/ u > 0. 



The other eigenvectors of the normalized Laplacian necessar- 
ily lie in (—1, 1). Suppose /i e ( — 1, 1) is one such eigenvalue 
of I— L k , associated with eigenvector v. It follows that v is an 
eigenvector of 2^2iLo(I ~ L k ) 1 , whose associated eigenvalue 
is v is (1 — p N+1 )/(l — p) > as follows from the sum 
of the truncated geometric series. Since the latter value is 
positive for any p € ( — 1, 1) it follows that ^^ (7 — L k ) 1 is 
positive definite. Further observe that it is also symmetric 
by definition so we can define 2^2iLn(I ~ ^kT — C'C 
where C is square and full rank, Q Theorem 7.2.10]. We 
then construct C = CD k 2 which is also square and full 
rank. This gives us a new symmetric positive definite matrix 



Hi , thus completing 



the proof. ■ 
By continuity of convergence of the dual variable to 
an error neighborhood implies convergence of the primal 



variables to an error neighborhood. Requiring the graph to 
not be bipartite is a technical condition to avoid instabilities 
created by Laplacian eigenvalues — 1. The restriction is not 
significant in practice. 

IV. Convergence Rate 

The basic guarantee in Proposition [T] is not stronger than 
convergence results for regular gradient descent. Our goal is 



to show that the approximate Newton method in ( 12 1 exhibits 



quadratic convergence in a sense similar to centralized (exact) 
Newton algorithms. Specifically, we will show that selecting 
N large enough, it is possible to find a neighborhood of I* 
such that if the iteration is started within that neighborhood 
iterates converge quadratically. 

Before introducing this result let us define the Newton 
approximation error e k as 

e k = H k d k N) + g k . (15) 



We further introduce the following standard assumptions to 
bound the rate of change in the Hessian of the dual function. 

Assumption 1. The Hessian H(l) of the dual function q(l) 
satisfies the following conditions 

(Lipschitz dual Hessian) There exists some constant L > 
such that \\H(l) - H(l)\\ < L\\l - l\\ \fl,l £ W n . 

(Strictly convex dual function) There exists some constant 
M > such that ||-ff(0 _1 || < M V/ G E™ . 

As is usual in second order optimization methods we use 
the gradient norm \\g k \\ = ||<7(£fc)|| to measure the progress 
of the algorithm. The aforementioned quadratic convergence 
result establishes that for any graph we can always select N 
large enough so that if an iterate l k is sufficiently close to the 
optimal I*, the gradient norm ||.gfc+ m || of subsequent iterates 
l k + m decays like 2 2 . This is formally stated in the following. 

Proposition 2. C ons ider ADD-N algorithms characterized 



by the iteration {12\ with Hi as defined in (11 1. Let 



Assumption [7] hold and further assume that the step size is 
a k + m — 1 for all k > m. Let e be a uniform bound in the 



norm of the Newton approximation error e k in (751 so that 
\\ e k\\ < tfor all k. Define the constant 

B = e + M 2 Le 2 . (16) 



Further assume that at time k it holds \\g k \\ < 1/(2M 2 L) 
and that N is chosen large enough to ensure that for some 
8 e (0, 1/2), B+M 2 LB 2 < S/(4M 2 L). Then, for all m > 1, 



1.9 AH 



< 



1 



B 



S (2 2 



1) 



2 2m M 2 L M 2 L 
In particular, as m — > oo it holds 



2 2 " 



lim sup ||3fc H 



< B 



2APL ' 



(17) 



(18) 



Proposition [2] has the same structure of local convergence 
results for Newton's method [4, Section 9.5]. In particular, 
quadratic convergence follows from the term 1 / (2 2 ) in ( jl7| . 
The remaining terms in (17i are small constants that account 
for the error in the approximation of the Newton step. 



Notice that Proposition [2] assumes that at some point in 
the algorithm's progression, ||^|| < 1/{2M 2 L). Quadratic 
convergence is only guaranteed for subsequent iterates l k + m . 
This is not a drawback of ADD, but a characteristic of all 
second order descent algorithms. To ensure that some iterate 
l k does come close to I* so that \\gk\\ < 1/(2M 2 L) we use 
a distributed adaptation of backtracking line search (Section 
[TVjAjl. 

To proceed with the proof of Proposition [2] we need two 
preliminary results. The first result concerns the bound e 
which was required to hold uniformly for all iteration indexes 
k. While it is clear that increasing N reduces ||efe||, it is not 
immediate that a uniform bound should exist. The fact that a 
uniform bound does exists is claimed in the following lemma. 

Lemma 1. Given an arbitrary e > 0, there exists an N such 
that the Newton approximation errors e k as defined in ( |75| ) 
have uniformly bounded norms \\e k \\ < 6 for all iteration 
indexes k. 

Proof: We begin eliminating the summation from our 
expression of the Newton error by observing that a telescopic 
property emerges. 

H k d[ N) + g k = H k (- ( D k lB kT D^g^j + g k 

= ^ - (D k - B k ) J2 (D^B^D^J g h 

= (i-jZiD^Bky-iD-^Buy+^gu 

= (B k D~Jr^g k 

We introduce the matrix V G R" xn_1 , made up of n — 1 
orthonormal columns spanning 1 . We observe that VV' = 

I n — — , and since g G l 1 ^ we have g = VV g. Our descent 

n i 

occurs in 1 so we restrict our analysis to this subspace. We 



Lemma 2. Let Assumption Uj hold. Let {l k } be a sequence 
generated by the method mf. For any stepsize rule a k , we 
have \\g k+1 \\ < (1 - a k )\\g k \\ + M 2 La 2 k \\g k \\ 2 + a k \\e k \\ + 
M 2 La 2 \\e k \\ 2 . 

Proof: We consider two vectors w G M. n and z G. We 
let £ be a scalar parameter and define the function j/(£) — 
Vy(w + £z). From the chain rule, it follows that J|y(£) = 
H(w + £z)z. Using the Lipschitz continuity of the residual 
function gradient [cf. Assumption[TJa)], we obtain: g(w+z) — 
g(w) - y(l) - 2/(0) = £ £y(m = fo H ( w + 



have \\V'(B k D^) N+1 9k\ 



\V'(B k D^) N+1 VV'g k \\ < 

n N+l ( TD T\-X\ 



\\V'(B k D^)" +1 V\\\\V'g k \\ < p"^(B k D^)\\g k \\ from 
the triangle inequality and the following definition. For a ma- 
trix X, p{X) is the radius of a disc containing all eigenvalues 
with subunit mag nitude. In this problem p(V (BkD'^V) 
coincides with the largest eigenvalue modulus and p{B k D^ 1 ) 
is the second largest eigenvalue modulus. From ifTTl we have 



p(B k D^) <1- 



1 



nA(G)(diam(G) + l)6 r 



(19) 



where A(G) is the maximum degree of any node in G, 
diam(G) is the diameter of G and 6max is an upper bound 
on dual off diagonal elements of the dual hessian: [H k ]ij < 
femaxVi ^ j. Combining this fact with the assumption that 
g k is upper bounded for all k, the result follows. ■ 
Another preliminary result necessary for the proof of 
Proposition [2] is an iterative relationship between the gradient 
norm ||(/fc+i|| at iteration k + 1 and the norm \\g k \\ at iter- 
ation k. This relationship follows from a multi-dimensional 
extension of the descent lemma (see |3|) as we explain next. 



< 



{H{w + £z)-H(w))zdt 



H(w)zd£, 



o 



< / \\H{w + &)-H{w)\\\\z\\d£ + H{w)z 
Jo 

r 1 l 

< \\z\\ I4\\z\\d€ + H(w)z= -\\z\\ 2 + H(w)z. 

Jo * 

We apply the preceding relation with w = l k and z = a k d k 
and obtain g(l k + a k d k ) - g(l k ) < a k H{l k )d k + ^al\\d k \\ 2 . 
By Eq. ( 15 I, we have H(l k )d k = —g(l k ) + e k . Substituting 
this in the previous relation, this yields g[l k + a k d k ) < 
(1 - a k )g{l k ) + a k e k + | a 2 k \\d k \\ 2 . Moreover, using As- 
sumption [ljb), we have ||4|| 2 = ||fl"(i fc )- 1 (-5(i fc ) + e fc )( |2 
\H(l k )- l \\ 2 \\-g{l k ) + e k \\ 2 <M 2 (2\\g(l k )\\ 2 + 2\\e k \\ 2 ' 



< \\H(l k )-'r\\-g{l k ) + e k r <M 2 [2\\g(l k )\\ 2 + 2 
Combining the above relations, we obtain ||<?(^+i)|| < (1 — 
a fc )||0(Jfc)|| + M 2 La 2 \\g(l k )\\ 2 + a k \\e k \\ + M 2 La 2 \\e k \\ 2 , 
establishing the desired relation. ■ 
The proof of Proposition 2 follows from recursive applica- 
tion of the result in Lemma [2] 

Proof: (Proposition [2]) We show Eq. ( 17 1 using induc- 
tion on the iteration m. Using a k = 1 in the statement 
of Lemma [2] we obtain ||ffk+i|| < M 2 L\\g k \\ 2 + B < 
1/AM 2 L + B, where the second inequality follows from 
the assumption ||gfe|| < 1/2M 2 L. This establishes relation 
( [T7| ) for rn = 1, We next assume that ( fTT) holds for 
some m > 0, and show that it also holds for m + 1. Eq. 

l/AM 2 L 



(17 1 implies that \\g k + r . 



S/AM 2 L. 



< l/UPL + B 
Using the assumption B + M 2 LB 2 < 8/AM 2 L, this yields 
\\g k +m\\ < l + 25/4M 2 L < 1/2M 2 L, where the strict 
inequality follows from S G (0,1/2). Using a k+m = 1 in 



the generalized descent lemma [2] we obtain 



M z L\\g k+m+1 \\ < (M*L\\g k 



B. 



Using Eq. (17 1, this implies that M L\\g k+m+ i\ 



1 M 2 LB r 2 2 " 1 - 1 - 1 
■ 



2 2 



M LB 



M LB. 



M 2 L 2 2m 

Using algebraic manipulations and the assumption B 



M 2 LB 2 < 



||Sfc+m+l|| < 



4]^, this yields 
1 



+ B 



6 (2 2 



-1) 



2 2m+1 M 2 L M 2 L 2 2m+1 

completing the induction. Taking the limit superior in Eq. ( fT7) i 
establishes the final result. ■ 

A. Distributed backtracking line search 

Proposition [2] establishes local quadratic convergence for 
properly selected members of the ADD family. To guarantee 
global convergence we modify ADD to use time varying step 
sizes ckfc selected through distributed backtracking line search 
[4, Algorithm 9.2]. Line search implementation requires com- 
putation of the gradient norm \\gk\\ = Y^i=i 9k ■ This can be 
easily achieved using distributed consensus algorithms, e.g., 
|8|. However, since these consensus algorithms are iterative in 
nature, an approximate norm rjk is computed in lieu of \\gk\\- 
We assume that approximate gradient norms are computed 
with an error not exceeding a given constant 7/2 > 0, 



m-\\9k\\ <7/2 



(20) 



For fixed scalars a £ (0,1/2) and j3 £ (0,1), we set the 
stepsize ak equal to ctk — (3 mk , where m,k is the smallest 
nonnegative integer that satisfies 



rifc+i < (1 - aP m )Vk + B + 7. 



(21) 



The expression in (21 1 coincides with the regular (centralized) 
backtracking line search except for the use of the approximate 
norm rjk instead of the actual norm ||<jrfc|| and the (small) 



additive constants B and 7 respectively defined in (I61 and 
d20l 



While we introduce line search to ensure global conver- 
gence we start by showing that stepsizes selected according 



to the rule in (21 1 do not affect local convergence. As we 
show next, this is because if \\gk\\ < 1/(2M 2 L) as required 
in Proposition [2] the rule in (21 1 selects stepsizes = 1. 

Proposition 3. If at iteration k of ADD- N tlie gradient norm 
satisfies \\gk\\ < 1/(2M 2 L), the inexact backtracking stepsize 
rule in (|20| selects ak = 1. 



Proof: Replacing ctk — 1 in Lemma [2] and using 
the definition of the constant B, we obtain ||g/c+i|| < 
M 2 L\\ 9k \\ 2 + B < i|| Sfc || + B < (1 - a)\\g k \\ + B, where 
to get the last inequality, we used the fact that the constant 
a used in the inexact backtracking stepsize rule satisfies 
a £ (0,1/2). Using the condition on r\k [cf. Eq. (20 1], this 
yields rik+i < (l — tj)r) k + B+y, showing that the steplength 



a k = 1 satisfies condition (21 1 in the inexact backtracking 
stepsize rule. ■ 
As per Proposition [5] if ADD-A with backtracking line 
search is initialized at a point at which ||g || < 1/(2M 2 L), 
stepsizes a k = 1 are used. Therefore, Proposition|2]holds, and 
convergence to the optimum I* is quadratic, which in practice 
implies convergence in a few steps. Otherwise, selecting step 
sizes ctk satisfying pi) ensures a strict decrease in the norm 
of the residual function as proven by the proposition below. 



Proposition 4. C ons ider ADD-N algorithms char acte rized 
by the iteration (12) with as defined in (11 1. Let 

Assumption^hold and stepsizes oik being selected according 
to the inexact backtracking rule in ( |2i| ). Further assume that 
1 1 gfc I > 1 /2M 2 L and that N is chosen large enough to ensure 



(22) 



that the constants B and 7 in (16) and (20) satisfy 

where f3 is the backtracking rule constant and M and L are 
defined in Assumption [7] Then, the gradient norm at iteration 
k + 1 decreases by at least /3/(16M 2 L), 

P 



\9k+i\ 



< 



\9k\ 



16M 2 L' 



Proof: For any k > 0, we define = 



(23) 



2M 2 L( nk +j/2) ■ 



In view of the condition on rjk [cf. Eq. (|20|>], we have 
1 



2Af 2 L(|| 3fe ||+7) 



< Ctk < 



1 



2M 2 L\\ 9k \ 



< 1, 



where the last inequality follows by the assumption \\gk\ 
21 ^2 L ■ Using the preceding relation and substituting otk 
in the generalized descent lemma (lemma |2j, we obtain: 



(24) 
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9k 
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where the second inequality follows from the definition of 
a. k and the third inequality follows by combining the facts 
ctk < L ||efc|| < e for all k, and the definition of B. The 
constant a used in the definition of the inexact backtracking 
line search satisfies a £ (0,1/2), therefore, it follows from 
the preceding relation that ||<?fc+i|| < (1 — cfifc)||5fc|| + B. 
Using condition (20 1 once again, this implies nk+i < (1 — 
cotkjrjk + B + 7, showing that the steplength a k selected by 
the inexact backtracking line search satisfies a k > (3ctk - From 
we have rik+i < (1 — aak)i]k +B + j, which 
< (l-a(3a k )\\gk\\ + S + 27. Combined with 



condition (21 
implies 



9k+i\ 



Eq. (24 1, this yields 



Il5fe+xl 



< 1 



2Af 2 L(|| 5fe ||+7) 



\9k\ 



B + 2j. 



By 



22 



we also have 7 < B + 2j < f3/l6M 2 L, which in view 
of the assumption \\g k \\ > 1/2M 2 L implies that 7 < \\g k \\- 
Substituting this in the preceding relation and using the fact 
ct£ (0,1/2), we obtain \\g k+1 \\ < \\g k \\ -f3/%M 2 L + B + 2 1 . 
Combined with 22j this yields the desired result. ■ 
Proposition |4 shows that if ADD- A is initialized at a point 
with gradient norm ||go|| > 1/2M 2 L we obtain a decrease in 
the norm of the gradient of at least (3/16M 2 L. This holds true 
for all iterations as long as \\g k \\ > l/2M 2 L. This establishes 



Fig. 1. Primal objective (top), f(x k ) and primal feasibility (bottom), 
— 6|| with respect to dual descent iterations for a sample network 
optimization problem with 25 nodes and 75 edges. ADD converges two orders 
of magnitude faster than gradient descent. Increasing N reduces the number 
of iterations required to converge. 



Fig. 2. Primal objective (top), f(x k ) and primal feasibility (bottom), 
— b|| with respect to number of local information exchanges for a 
sample network optimization problem with 25 nodes and 75 edges. ADD 
converges an order of magnitude faster than consensus-based Newton and 
two orders of magnitude faster than gradient descent. 



that we need at most 16||<?o|| M 2 L/{3 iterations until we obtain 
||.9fe|| < 1/2M 2 L. At this point the quadratic convergence 
result in Proposition [2] comes in effect and ADD-/Y converges 
in a few extra steps. 

V. Consensus-based Newton 

An alternative approach to obtain an approximation to the 
Newton step is to use consensus iterations. This consensus- 
based inexact Newton algorithm is pursued in (9) and 
shares a connection with the algorithm proposed here. While 
consensus-based Newton is developed for a primal dual 
formulation, it can be adapted to dual descent as pursued here. 
In consensus-based Newton, approximate Newton directions 
d k are found by solving the following consensus dynamic 

= {D k + I)-\B k + I)df - (D k + J)- 1 ^, 

where the splitting H k = (D k + 1) — {B k + 1) was used. 

Observe that both consensus-based Newton and ADD use 
a choice of splitting. ADD uses D k ,B k as opposed to (D k + 
I), (B k + I) but the choice of splitting is not crucial. With the 
appropriate choice of splitting, the consensus update becomes 

d k ^=D^B k df-D^g k . 

Choosing the initial value til = results in a sequence of 
approximations of the Newton direction as follows 

4 X) = Dk'BkO D- l g k = -D- l g k = -H^g k 
4 2) = D-'B.i-D-'g,) D- 1 g k = -H^g k 

771 — 1 

4 m) = - E D kHD^B k D-^YD-^g k = -sjr-^gk 

We observe that after m consensus iterations our approxima- 
tion d£ is the same approximation arrived at by using ADD- 
N with N — rn — 1. Therefore, ADD has the same behavior 
as a fixed iteration version of consensus-based Newton. 



VI. Numerical results 

Numerical experiments are undertaken to study ADD's 
performance with respect to the choice of the number of 
approximating terms N. These experiments show that N = 1 
or N = 2 work best in practice. ADD is also compared to 
dual gradient descent [14] and the consensus-based Newton 
method in [9| that we summarized in Section [V] These 
comparisons show that ADD convergence times are one to 
two orders of magnitude faster. 

Figures [T] and [2] show convergence metrics for a randomly 
generated network with 25 nodes and 75 edges. Edges in the 
network are selected uniformly at random. The flow vector b 
is chosen to place sources and sinks a full diam(Cf) away from 
each other. All figures show results for ADD-0 through ADD- 
3, gradient descent, and consensus-based Newton. In Fig. [T] 
objective value f[x(l k )] and constraint violation ||Aa:(Zfc) — 6|| 
are shown as functions of the iteration index k. As expected, 
the number of iterations required decreases with increasing 
7Y. The performance improvement, however, is minimal when 
increasing N from 2 to 3. The convergence time of consensus- 
based Newton is comparable to ADD, while gradient descent 
is three orders of magnitude slower. 

The comparison in Fig. [T] is not completely accurate be- 
cause different versions of ADD differ in the number of com- 
munication instances required per iteration. These numbers 
differ for consensus-based Newton and regular dual descent 
as well. Fig. [2] normalizes the horizontal axis to demonstrate 
algorithms' progress with respect to the number of times 
nodes exchange information with their neighbors. Observe 
that in terms of this metric all versions of ADD are about an 
order of magnitude faster than consensus-based Newton and 
two orders orders of magnitude faster than gradient descent. 
The difference between Figs. [T] and [2] is due to the number of 
communication instances needed per each Newton iteration. 

Another important conclusion of Fig. [2] is that even though 
increasing 7Y in ADD decreases the number of iterations 
required, there is not a strict decrease in the number of 
communications. Indeed, as can be appreciated in Fig. [2] 
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Fig. 3. Histogram of the number of local communications required to reach 
||fl(Jfe)|| < 10 -10 for ADD-N with respect to parameter N, for 50 trials of 
the network optimization problem on random graphs with 25 nodes and 75 
edges. ADD-2 is shown to be the best on average by about 10% indicating 
that with respect to communication cost, larger N is not necessarily better. 
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Fig. 4. Min (left), mean (center) and max (right) number of local 
communications required to reach < 10~ 10 for gradient descent, 

consensus-based Newton and ADD, computed for 35 trials each on random 
graphs with 25 nodes and 75 edges(l), 50 nodes and 350 edges(2), and 
100 nodes and 1000 edges (3). The min and max are on the same order of 
magnitude for ADD, demonstrating small variance. 



ADD-2 requires fewer communications than ADD-3. This 
fact demonstrates an inherent trade off between spending 
communication instances to refine the Newton step 
versus using them to take a step. We further examine this 
phenomenon in Fig. [3] These experiments are on random 
graphs with 25 nodes and 75 edges chosen uniformly at 
random. The flow vector b is selected by placing a source 
and a sink at diam(Cf) away from each other. We consider an 
algorithm to have converged when its residual \\gk\\ < 10 -10 . 

The behavior of ADD is also explored for graphs of 
varying size and degree in Fig. |4] As the graph size increases 
the performance gap between ADD and competing methods 
increases. Consistency of ADD is also apparent since the max- 
imum, minimum, and average information exchanges required 
to solve ([TJ for different network realizations are similar. 
This is not the case for neither consensus-based Newton nor 
gradient descent. Further note that ADD's communication 
cost increases only slightly with network size. 



VII. Conclusion 

A family of accelerated dual descent (ADD) algorithms to 
find optimal network flows in a distributed manner was intro- 
duced. Members of this family are characterized by a single 
parameter N determining the accuracy in the approximation 
of the dual Newton step. This same parameter controls the 
communication cost of individual algorithm iterations. We 
proved that it is always possible to find members of this 
family for which convergence to optimal operating points is 
quadratic. 

Simulations demonstrated that N = 1 and N = 2, 
respectively denoted as ADD-1 and ADD-2 perform best in 
practice. ADD-1 corresponds to Newton step approximations 
using information from neighboring nodes only, while ADD- 
2 requires information from nodes two hops away. ADD-1 
and ADD-2 outperform gradient descent by two orders of 
magnitude and a related consensus-based Newton method by 
one order of magnitude. 

Possible extensions include applications to network utility 
maximization Q3Q , general wireless communication problems 
ifTTll . and stochastic settings [16|. 
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